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Abstract 

Unsupervised domain adaptation seeks to learn an in¬ 
variant and discriminative representation for an unlabeled 
target domain by leveraging the information of a labeled 
source dataset. We propose to improve the discriminative 
ability of the target domain representation by simultane¬ 
ously learning tightly clustered target representations while 
encouraging that each cluster is assigned to a unique and 
different class from the source. This strategy alleviates the 
effects of negative transfer when combined with adversarial 
domain matching between source and target representations. 
Our approach is robust to differences in the source and tar¬ 
get label distributions and thus applicable to both balanced 
and imbalanced domain adaptation tasks, and with a simple 
extension, it can also be used for partial domain adapta¬ 
tion. Experiments on several benchmark datasets for do¬ 
main adaptation demonstrate that our approach can achieve 
state-of-the-art performance in all three scenarios, namely, 
balanced, imbalanced and partial domain adaptation. 

1. Introduction 

Deep neural networks have demonstrated remarkable ad¬ 
vancement in supervised learning for a wide variety tasks 
in the past decade. However, training such models usually 
requires the availability of massive labeled data, which is 
prohibitive in some applications. Therefore, it is of interest 
to develop domain invariant classification models that are 
able to generalize to other (unlabeled) domains beyond that 
for which they were trained. Unsupervised domain adapta¬ 
tion is a general framework for learning domain invariant 
representations. The goal is to learn a shared latent represen¬ 
tation (encoding) of (labeled) source and (unlabeled) target 
instances complemented with a classifier to accurately label 
instances using the latent representation as input. During 
learning, the differences between source and target repre¬ 
sentations are minimized at a population (distribution) level, 
while the discriminative ability of the classifier is maximized 
using only the labeled source data. Subsequently, the learned 
classifier and encoding can be used to label target instances 


without the need for manual labeling effort. 

A typical application for this setting is image classifica¬ 
tion, where the instances are images, labels denote different 
image classes, the latent representations are often obtained 
via a convolutional encoder, and the source and target do¬ 
mains consist of instances of the same image classes but 
obtained under different technical conditions. 

Provided there is no available labeled data for the target 
domain, existing unsupervised domain adaptation methods 
generally match the distributions of the source and target 
latent representations (features). This approach assumes that 
the source and target share the same label domain and distri¬ 
bution, i.e., same labels and comparable label prevalences. 
However, it cannot be guaranteed that the representation 
learned by distribution matching is discriminative, i.e., latent 
representations for different classes may not be well sepa¬ 
rated, thus difficult to classify. Fortunately, this matching 
approach has been widely successful in practice, particularly, 
in image classification problems [ 1_8, 2G51- 

In real applications, the unknown target label distribution 
can be different from the source, i.e., labels in the target do¬ 
main can be observed with different proportions compared to 
the source. Further, in the case of partial domain adaptation 
00 , the set of labels in the target domain can be a subset 
of the source. As illustrated in Figure [I] these scenarios are 
challenging because matching the distribution of the latent 
features for source and target domains is likely to result in 
negative transfer. This happens because distribution match¬ 
ing forces observations from the target to be placed nearby 
source observations whose label is not present in the target, 
thus negatively impacting the quality of the target encoder. 
As a result, the adapted model may be sometimes worse than 
that trained on the source, since the target representation is 
poorly or lacks discriminative ability after adaptation. 

In this paper, we propose an approach that extends the 
scope of unsupervised domain adaptation by relaxing the 
assumption of needing shared label domain and distributions. 
Consequently, our approach is robust to differences in source 
and target label distributions. Assuming that the source label 
distribution is uniform, which can be easily achieved by 
resampling, we consider the following three scenarios: 
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Figure 1: Negative transfer in partial domain adaptation. 
Source and target features are colored orange and blue, re¬ 
spectively. Instances from prevalent classes in the target 
will be negatively transferred to smaller classes or those not 
present in the target, which results in poorly discriminative 
target features. 


Balanced domain adaptation: This is the case when the 
target label distribution is balanced in relation to the source, 
which is the basic assumption of most previous methods 
118, 9, 29], and can be addressed by distribution matching 
of the source and target features. 

Imbalanced domain adaptation: When the target la¬ 
bel proportions are substantially different in relation to the 
source, thus imbalanced. In this scenario, negative transfer is 
likely to occur, which will result in degraded performance for 
some classes, usually the most prevalent ones, as illustrated 
in Figure [T] 

Partial domain adaptation: This scenario, recently 
studied in mm considers the case when the target label 
domain is a subset of the source. In some sense, it can be 
understood as an extreme case of class imbalance, that in 
which some of the target classes occur with zero probabil¬ 
ity. However, these two scenarios need to be considered 
separately because in terms of performance, they need to 
be evaluated differently. For instance, in imbalanced do¬ 
main adaptation the overall classification accuracy is not 
necessarily meaningful or informative. Alternatively, one 
may consider class-wise accuracies which may be a more 
appropriate performance metric. 

In this work, we propose a new approach that accommo¬ 
dates the three scenarios described above. Our key contri¬ 
bution is to improve the discriminative ability of the target 
latent representation by simultaneously i) learning tightly 
clustered target representations, ii) encouraging that each 
cluster gets assigned to a different and unique class from 
the source, and in) minimizing the discrepancies between 
source and target representations by distribution matching. 
We will show empirically that these three criteria largely 
alleviate the effects of negative transfer on imbalanced and 
partial domain adaptation tasks. Finally, our experiments 
demonstrate that our approach yields excellent results on all 
three scenarios. 


2. Robust Unsupervised Adaptation 

Our approach extends the ability of current domain match¬ 
ing adaptation models to the imbalanced and partial settings. 
This is achieved by learning a tightly clustered target repre¬ 
sentation while encouraging that each cluster is assigned to a 
unique and different class from the source. These two criteria 
are combined with representation distribution matching as 
in Adversarial Discriminative Domain Adaptation (ADDA) 
[29], which will result in more discriminative and domain 
invariant target representations, as will be demonstrated in 
the experiments. 

Assume we have a labeled source dataset, {X s , Y s }^f 1? 
where X s and Y s represent source inputs and labels, respec¬ 
tively, and N s is the number of observed pairs. Source labels, 
Y s G y s , can take one of K s distinct labels with (marginal) 
probability P(Y S ). We seek to leverage information in the 
source and a set of (unlabeled) target inputs, {Xt}£l 1 of 
size N t , to make predictions about their labels, i.e., to ob¬ 
tain Y t for X t . Similarly, Y t G At can take one of K t 
distinct labels with (marginal) probability P(Y t ). Here we 
not only consider the standard scenario, denoted as balanced 
domain adaptation , where As = At and P(Y S ) = P(Y t ), 
but also imbalanced domain adaptation , where As = At, but 
P(Y S ) 7 ^ P(Y t ), i.e., the set of labels in source and target 
domains are the same but observed in different proportions. 
Further, we also consider partial domain adaptation , where 
At C As, the target labels are a true subset of the source 
labels, so K t < K s . This scenario can be seen as an extreme 
case of imbalanced domain adaptation where some labels in 
the target domain are observed with probability zero. 

Our approach assumes that we can obtain a source en¬ 
coder, = E S (X S ), and classifier, p(Y s = k\Z s ) = 
C(Z S ), for k = 1 that can be trained on 

{X s , Y s }^ = ! 1 . To obtain latent features that are uninforma¬ 
tive of the differences between source and target domains, 
thus in principle only containing information about the labels, 
we specify a discriminator, p(Y dornain = s\Z d ) = D(Z d ), 
tasked to learn whether Z d , for d G {s, t }, is from the source 
or target, Y dornain G {s, t}. This means that we seek to learn 
a discriminator such that p(Y dornain = s\Z s ) = D(Z S ) —)> 1 
and p(Y dornain = s\Z t ) = D(Z t ) 0. This is done in 
an adversarial fashion, similar to ADDA [29]. Further, we 
encourage the target latent representation, to be clustered 
into K s components with centroids, Z c , for c = 1.. *, K s . 
The objective for the model illustrated in Figure [2]consists of 
five terms, namely, classification, L c i a , adversarial, L adv , en¬ 
coder, L enc , clustering L dec , and dissimilarity, L d [ s , which 
we describe below. 

2.1. Classification objective 

The supervised model consists of the source encoder 
Z s = E S {X S ) and classifier p(Y s = k\Z s ) = C(Z S ), for 
k = 1,_, K s . These are pre-trained on {A s , A s }^! 1 by 
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Figure 2: Robust unsupervised domain adaptation architec¬ 
ture. Y s and Y ( j are source and domain labels, respectively. 
Latent features for source, target and target cluster centroids 
are denoted as Z s , Z t and Z c , respectively. Model blocks 
are represented as rectangles and losses as ellipses. 

maximizing the following objective 

-Lcia = E (®,3/)~p ( x s ,Y s )[2/ T log{C'(.E s (a;))}], (!) 

where y is K s -dimensional one-hot-vector representation for 
Ys, C(Z S ) is assumed to have a softmax activation function 
and p(X s , Y s ) is the joint empirical distribution of the source. 
Once trained on the source dataset, both E S (X S ) and C(Z S ) 
will be maintained fixed during adaptation. 

2.2. Adversarial objectives 

To minimize the impact of the variation caused by the 
differences between source and target domains, we utilize 
the standard adversarial objective, L a d v , to learn the discrim¬ 
inator D(-) by maximizing 

-^adv = D(E s (%)) 

+ E x ^p (Xt ) log(l - D(E t (x))), ^ 

where p(X s ) and p(X t ) are the marginal empirical distribu¬ 
tions for source and target, respectively. For the generator, 
we separately maximize over the target encoder via 


where Q and P are the soft assignment and auxiliary distri¬ 
butions respectively. For Q , we use soft-assignments from a 
mixture of Student’s t distributions with a degrees of free¬ 
dom [211, written as 

(l + ||^-Z c || 2 /a)-^ 

“ Ec'(l + \\ Z i~ Z c'\?/ot)-^ ’ 


from which qi c approximates the probability of instance i 
being assigned to cluster c. We set a = 1 in all experiments. 

For P, the auxiliary distribution, we encourage cluster 
tightness by raising qi c to a power of 2 and normalizing 
accordingly, so 


Pic — 


f i q ^ 

J C ^±IC 


EcfX 2 


(5) 


where f c = Y, lie- Note that 0 naturally results in a self¬ 
reinforcement mechanism that encourages latent features 
Z t to lie closer to the centroids, {Z C }^I 1? of the mixture 
distribution. In the experiments, for the balanced and imbal¬ 
anced cases, the K s cluster centroids, Z c , are initialized to 
the mean of the latent representations of target instances that 
are predicted as class c by C'(-), for c = 1,..., K s , so that 
each cluster is identified with one of the labels in the source. 


2.4. Cluster dissimilarity objective 

One limitation of the clustering loss in Q is that although 
it encourages clusters to be tight, it does not explicitly en¬ 
courages them to be pure (consisting of members of the same 
class). Moreover, in some cases it may result in domain col¬ 
lapse, i.e., clusters of distinct classes being located arbitrarily 
close. To avoid these issues, we seek to match at the cluster 
level by encouraging that instances from different clusters 
are predicted as different classes. So motivated, we define 
A = [C(Zi) ... C(Zk s )\ as the K s x K s matrix whose 
columns contain the distribution of label predictions for the 
K s centroids, Z c , using classifier C(Z c ). Then define the 
cluster dissimilarity objective, L^ s , to be minimized as 


-^enc ^x~p(X t ) log(D(E t (x))) , (3) 

where we have inverted the labels relative to 0 as in m, 
which has the same properties of the original min-max (ad¬ 
versarial) loss used in GAN but results in stronger gradients 
for the target encoder. 

2.3. Clustering objective 

Assuming that in latent space we have as many clusters, 
K , as distinct labels in the source domain, so K = K s , we 
denote their centroids as Z c , for c = 1,..., K s . Borrowing 
from Deep Unsupervised Embedding (32), we minimize the 
following Kullback-Leibler (KL) divergence 

N K s 

L Aec = KL{P\\Q) = Y J Y J P^°P-’ W 


L dis = ||A t ^-/|| f , (6) 

where 11 • 11 ^ is the Frobenius norm. Under 0 entries of 
A T A contain similarities between the class membership 
probabilities for all pairs of cluster centroids. Note that 
diagonal entries of A T A will encourage columns of A to 
have unit norm. As a result, columns of A which represent 
probability vectors (positive and summing up to one) will be 
encouraged to become one-hot vectors. By minimizing 0 
we encourage that the K s predicted class membership prob¬ 
ability vectors are different and close to one-hot-vectors, in 
which case, each centroid will tend to be assigned to a differ¬ 
ent class with high probability. For the implementation, we 
can write 

c j/i 
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Algorithm 1 Training with SGD. 

Let { 0e s , 0e ± , Oc, @d,Z c } be the parameters for each 
model component. 

Input: 

Source and target data: { X s , Y s }, X t 
Learning rates {7 adv , 7 en c, 7dec, 7dis} 

Batch size M 
Pre-training steps: J a d v 

Output: 

Target encoder: E t (-) 

Classifier: C(-) 

Training a source model, E s (-) and C(-), with L c i a 
i = 0 

while not converge do 

Draw random minibatch {X s }fL 1 , {X i }^L 1 
if i > / adv then 

@E t = @E t — TdecV^^Ldec 
= Z c '7dec^Z c -^ / dec 
= Tdis ^Z c -^dis 

end if 

@D = @D ~ TadvVfl^Ladv 
@E t = @E t Tenc ^ 9e+ -^enc 

i = i Hb 1 

end while 


where a c = C(Z c ) is a column of A and compared to ([6}, 
we have excluded diagonal elements of A T A. We found em¬ 
pirically that excluding the diagonal terms stabilizes training, 
thus preferred in the experiments. 

2.5. Complete objective 

The proposed robust unsupervised adaptation approach 
proceeds by first optimizing L c i a in 0 on the source data, 
{X s , F s }^! 1 . Then, with fixed source encoder and classifier, 
E s (X s ) and C(Z S ), respectively, we will perform domain 
adaptation by updating in sequence the discriminator, target 
encoder and cluster centroids, {Z C }^L f 1? using the following 
complete objective: L = L a( ^ v + L enc + L^ ec + as 
in 0, 0, 0 and 0. Instead of specifying parameters 
in the complete objective to balance the different losses, 
we set different learning rates for each loss component as 
shown in Algorithm [T] In our experiments, 7 enc and 7 a d v 
were set to the values specified originally in ADDA [29] 
and further set 7di S = 27 d ec - We will show that the model 
is fairly insensitive to the choice of 7d ec - Note that during 
the adaptation, the source labels are not needed and the 
source instances are only used to update the discriminator. 
We found empirically that is beneficial to train the model 
with only the adversarial objectives for several iterations, 
(/ a d v = 0 ~ 150 in the experiments), to provide a good 
initialization point for the clustering loss. 



(a) Before (b) After 


Figure 3: Features before and after augmentation of the tar¬ 
get set. The orange and blue items represent the source and 
target features. The dashed line is the boundary separating 
source and target learned by the discriminator. The arrows 
illustrate the direction of the adaptation. After the augmen¬ 
tation of the target inputs, the discriminator will set a softer 
boundary for the target domain. Since the target instances 
tend to transfer perpendicularly to the boundary via gradi¬ 
ent updates, this modification can reduce negative transfer 
effects from the target toward outlier source classes. 

2.6. Partial domain adaptation 

In order to initialize the cluster centroids, we need to first 
specify the number of clusters, K. In the setting considered 
above we simply let K = K s , which is reasonable assum¬ 
ing source and target domains share the same label space. 
However, this is not viable in the partial setting because 
the true number of target classes, K t , is unknown. If we set 
K t < K < K s at least one of the clusters will be assigned to 
a label present in the source but not the target. Alternatively, 
if we set K < K t < K s , at least one of the target classes 
will not be assigned. Therefore, it seems cumbersome to 
guess the number of target classes. 

Instead of attempting to estimate K t , we propose a simple 
strategy consisting on augmenting the target set with a por¬ 
tion of source data as shown in Figure [3] Specifically, when 
drawing a minibatch from the target to update the parameters 
of the model, {6e ± , Z c } in Algorithm[l} we augment it 
with samples from the source, e.g., 50% of the minibatch 
is drawn from the target and 50% from the source (without 
using labels). In this way, we can ensure that the augmented 
target has samples from all classes from the source. As a 
result, the partial domain adaptation task is converted to a 
pseudo imbalanced domain adaptation problem such that 
K = K s = K t , which is appropriate for our formulation. 
It is worth noting that when augmenting the target set with 
source instances we do not use their label information. Fur¬ 
ther, in Figure [3] we see that augmenting the target data with 
(a subset of) the source has the potential benefit of making 
it more difficult for the discriminator to distinguish source 
and target instances, thus resulting on a softer discriminator 
boundary for the target that can help reduce the effects of 
negative transfer. 
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In this setting, the cluster centroids are initialized using 
the source data rather than the combination of source and 
target instances. In a real setting where we have no insight 
of whether the target contains all or a subset or the source 
classes, we can first treat it as imbalanced domain adapta¬ 
tion, then switch to partial domain adaptation if noticeable 
negative transfer is observed, e.g ., by inspecting the t-SNE 
embedding of the learned latent representations. 

3. Related Work 

Unsupervised domain adaptation in the balanced setting 
has been extensively studied. The general idea is to match 
the source and target marginal distributions directly or indi¬ 
rectly. The latter, by matching their latent representations, 
(27] [28]] proposed to match the moments of different encod¬ 
ing layers of the latent representations. This approach is easy 
to implement and achieves competitive results on several 
benchmarks. lfl8l [30l used the Maximum Mean Discrep¬ 
ancy (MMD) framework to implicitly measure the distance 
between source and target distributions. Specifically, they 
match the kernel embeddings of both distributions by mini¬ 
mizing their MMD. Further, fl9l improved the MMD-based 
approaches by allowing separate classifiers for source and 
target domains. 

Driven by the increasing popularity of the Generative Ad¬ 
versarial Networks (GANs) HD , recent adaptation methods 
resort to matching the distributions in an adversarial manner. 
I LSI 29]] added a discriminator to the latent representation to 
distinguish features from different domains, while the fea¬ 
ture encoders are trained to mislead the domain discriminator 
so it cannot find an effective boundary that distinguishes be¬ 
tween source and target instances. The domain discriminator 
and feature encoders are trained adversarially as a min-max 
objective. Inspired by (H, (25ll utilized the classifier dis¬ 
crepancy to detect target samples that are distant from the 
source. Instead of using a discriminator, they proposed to 
adversarially maximize the discrepancy between two source 
classifiers, while training a feature encoder to reduce the 
inconsistency of their predictions. 

The approaches described above rely on the assumption 
that source and target share the same label domain and dis¬ 
tribution. This assumption limits their applicability to situ¬ 
ations where these are violated, i.e., imbalanced or partial 
domain adaptation scenarios. d utilized the pairwise simi¬ 
larity information from the source to regularize the implicit 
clustering of the target domain and thus it has the potential 
to be used for the imbalanced scenario. However, their clus¬ 
tering on target domain is determined from the source only, 
thus it does not benefit from the local information provided 
by the target. BUD introduced the concept of partial do¬ 
main adaptation, in which target classes are assumed to be 
a subset of the source domain. They reduce the effect of 
negative transfer by selecting out classes not present in the 


target, however, their approaches are only moderate when 
the source and target label domains are the same. In our 
approach, we first transform the partial scenario into a spe¬ 
cial imbalanced setting via target domain augmentation, then 
we perform domain adaptation with our clustering-based 
objective without further changes. 

4. Experiments 

We conduct experiments on three domain adaptation 
benchmark datasets: the digits datasets, Office31 and 
VisDA2017. The results below demonstrate that our method 
is robust to the difference in source and target label distribu¬ 
tions by producing state-of-the-art classification performance 
in all of the three scenarios considered. To quantify the im¬ 
pact of the newly introduced dissimilarity loss, we perform 
experiments with and without it, denoted as Ours and Ours 
(no Ldi s ), respectively. In order to validate Figure [3] we also 
define ADDA-mix, which corresponds to standard ADDA 
with argumented target inputs as described in Section |2.6| 
The mixture rate for source and target is 1:1 for all partial 
domain adaptation experiments. 

4.1. Datasets 

The digits datasets: We consider three digits datasets 
with varying difficulties: MNIST, SVHN and USPS, each 
containing 10 classes for digits 0-9. The encoder architecture 
for the digits images is the modified LeNet from (29l . For the 
domain classification, the adversarial discriminator consists 
of 3 fully connected layers with 500 hidden units for the first 
two layers and 2 for the output. All images are converted to 
grayscale and rescaled to 28 x 28 pixels. We consider three 
directions of transfer: SVHN^MNIST, USPS^MNIST and 
MNIST—HJSPS. 

Office31: This is a standard benchmark for domain adap¬ 
tation widely used in computer vision, it consists of 4652 
images from 31 classes. These images are collected from 
three distinct domains: Amazon (A), which contains images 
downloaded from amazon.com, Webcam (W) and DSLR (D), 
which contain images taken by a web and a digital SLR cam¬ 
era, respectively, with different background settings. This 
is a relatively difficult dataset since the Webcam and DSLR 
contains very small amount of images, i.e., less than 10 for 
some classes, which may easily lead to overfitting during the 
adaptation process. 

In the experiments, we consider all the six directions of 
adaptation: A—AV, A^D, W^A, W^D, D^W and D^A. 
The architecture of the encoder for images in Office31 is a 
Resnet-50 (T3| pre-trained on ImageNet. All the images are 
first resized to 256 x 256 pixels RGB images, then random 
cropped during training and central cropped during testing 
into 224 x 224 RGB images. Due to the small size of Of- 
fice31, we approach the task as fully transductive, where all 
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(a) ADDA (b) Ours 

Figure 4: t -SNE plot of the feature domains for 

SVHN-^MNIST. 


labeled instances from the source and all unlabeled instances 
from the target are used during training and adaptation. This 
is the same for the experiments in lUSl 9. 291. 

VisDA2017: This is a dataset for the Visual Domain 
Adaptation Challange from synthetic 2D renderings of 3D 
models to real images. It consists of 12 classes shared by 
both domains, each with a very large number of instances. 
We use ResNet-50 as the source and target encoders. Com¬ 
plementary to Ofhce31, this dataset will validate the perfor¬ 
mance of our method on large-scale datasets. 

4.2. Balanced domain adaptation 

The digit datasets: Experiments are conducted with all 
the 10 digits in the balanced setting. Results are shown in 
Table [T] Our method outperforms all the other baselines in 
all three directions, which demonstrates the effectiveness of 
our method in the standard balanced setting with moderate 
to large datasets. Figure [4] shows the f-SNE embeddings of 
the target representation for SVHN^MNIST, from which it 
can be observed that results from ADDA are more entangled, 
while those from ours are more concentrated and discrimi¬ 
native, thanks to the imposed clustering structure. Further, 
if we do pure DEC clustering on MNIST without marginal 
domain matching, i.e, without L a d v and L enc , the resulting 
clustering accuracy according to [ 32] is 85%, which is sub¬ 
stantially lower than the accuracies with domain matching 
from SVHN^MNIST and USPS^MNIST (see Table [l). 
These results highlight the benefits of jointly doing marginal 
domain matching and discriminative clustering. 

Office31: In the case for balanced setting with Ofhce31 
data, we used all of the 31 classes in the three domains. It 
is difficult to make the adaptation converge especially when 
the small domains (W and D) are used as target. In our 
implementation, we set the initial learning rates for the dis¬ 
criminator and target encoder to le-3 and le-5. The learning 
rates will be divided by 10 every 100 and 200 iterations re¬ 
spectively, with a batch size of 64. Table [2] shows the results 
for balanced domain adaptation for Office31. Our method 
outperforms ADDA, DANN and DAN in all transfer direc- 


Table 1: Balanced domain adaptation on the digits datasets. 


Method 

SVHN-)-MNIST 

USPS—)-MNIST 

MNIST—msps 

Source 

0.598 

0.634 

0.771 

DANN f 91 

0.746 

0.909 

0.880 

ADDAf29l 

0.760 

0.901 

0.894 

DIFAim 

0.897 

0.897 

0.923 

MCD[25l 

0.962 

0.941 

0.942 

Adversarial Dropoutl241 

0.950 

0.931 

0.932 

Ours (no L dis ) 

0.965 

0.976 

0.943 

Ours 

0.965 

0.979 

0.952 


Table 2: Balanced domain adaptation on Office31. 


Method 

A—W 

A-^D 

W—>A 

W^D 

D—W 

D—> A 

Source 

0.629 

0.604 

0.477 

0.982 

0.951 

0.504 

DANfT8l 

0.685 

0.670 

0.531 

0.990 

0.960 

0.540 

DANN 0 

0.730 

0.719 

0.535 

0.992 

0.964 

0.501 

ADDA1291 

0.751 

0.677 

0.573 

0.996 

0.970 

0.525 

Ours (no L dis ) 

0.787 

0.743 

0.535 

0.996 

0.979 

0.532 

Ours 

0.810 

0.727 

0.595 

0.998 

0.979 

0.553 


tions, which validates its effectiveness on small datasets in 
the balanced adaptation scenario. 

4.3. Imbalanced domain adaptation 

We conduct an experiment for imbalanced domain adap¬ 
tation on the MNIST—>USPS by manually sampling an im¬ 
balanced target domain. From 0 to 9, the ratio of classes 
linearly decreases from 1 to 0.3 on USPS (target) and is kept 
uniform for the MNIST (source). This means that, for the 
sampled USPS, if there are 10 images of 0s, then there are 
only 3 images for 9s. The experiment is meant to illustrate 
the effects of negative transfer in the imbalanced setting and 
ability of our method to alleviate the negative transfer caused 
by the class imbalance. 

Table [3] shows accuracies on the target domain (USPS) 
for each class. For ADDA and DANN, we should note 
that, although there is no obvious difference on the over 
all accuracies compared with the source, the large classes 
(0 and 1) are degraded during the adaptation due to the 
negative transfer illustrated in Figure [l] On the contrary, 
our method is robust against the imbalance and results in 
very high accuracies for most of the classes, especially 0s 
and Is. Further, is worth noting that the average accuracy of 
our method is only decreased by 0.017% compared to the 
balanced scenario shown in Table Q] 

The target representations for the four models are plotted 
in Figure [5] For ADDA and DANN, it is clearly shown 
that the large classes, e.g., purple and dark blue for 0s and 
Is, are negatively transferred toward other smaller classes 
when compared with the source model, while the target 
representation from our approach is more discriminative and 
better clustered. 
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Figure 5: t -SNE embeddings of the target representations for MNIST—>USPS in the imbalanced setting. 


Table 3: Imbalanced domain adaptation for MNIST^TJSPS. 


Method 

0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

Overall 

Source 

0.816 

0.962 

0.874 

0.663 

0.860 

0.844 

0.888 

0.857 

0.747 

0.051 

0.771 

dannOD 

0.485 

0.636 

0.859 

0.904 

0.865 

0.956 

0.947 

0.590 

0.916 

0.548 

0.767 

ADDAI291 

0.493 

0.640 

0.874 

0.874 

0.795 

0.963 

0.912 

0.939 

0.910 

0.825 

0.781 

Ours (no L dis ) 

0.986 

0.973 

0.980 

0.934 

0.875 

0.944 

0.947 

0.959 

0.946 

0.695 

0.932 

Ours 

0.989 

0.977 

0.929 

0.904 

0.795 

0.963 

0.965 

0.932 

0.964 

0.876 

0.935 


4.4. Partial domain adaptation 

Office31: We select the 10 classes shared by Office31 
and Caltech-256 as our target labels. For each direction of 
adaptation, we use all the images of these 10 classes in the 
target split as the target domain (denoted as A10, W10, DIO), 
and images from all the 31 classes in the source split as the 
source domain (denoted as A31, W31, D31). As described 
in Section [276} we first convert the partial domain adaptation 
into a pseudo imbalanced setting, by augmenting the target 
with (unlabeled) source data, then normal domain adaptation 
is conducted as before. In this setting, the cluster centroids 
are initialized using the source data instead of all source and 
target instances. Also we set I a d v = 0 in Algorithm 1, since 
training with raw adversarial objectives is likely to degrade 
the performance. 

The results on Office31 are presented in Table [5] SAN 
0 and PADA 0 are two of the first approaches specifi¬ 
cally designed for partial domain adaptation. The results 
suggest that our method can outperformance SAN by a large 
margin and is competitive to PADA, which was proposed 
very recently. Combined with the experiments on Office 
31 in the balanced setting, these experiments validate the 
robustness of our method in learning discriminative target 
representations from small target datasets such as Office31. 

VisDA2017: The VisDA2017 dataset was originally used 
for the balanced domain adaptation setting with shared label 
domain and distribution. Following 0, we only reserve 
images of the first 6 classes in alphabetic order in the target 
domain (REAL-6, SYN-6), and all the images of the 12 


Table 4: Partial domain adaptation on VisDA2017. 


Method 

Syn-12—^Real-6 

Real-12—^Syn6 

Source 

0.421 

0.568 

DANN(9] 

0.327 

0.605 

ADDAI291 

0.545 

0.562 

ADDA-mix 

0.543 

0.605 

PADA (2 

0.535 

0.765 

Ours (no L dis ) 

0.682 

0.709 

Ours 

0.700 

0.846 


classes are kept in the source domain (REAL-12, SYN-12). 
The data preprocessing and experiment setting are the same 
as above for Office31 except that use ResNet-50 with 12- 
dimensional output instead of 31. 

The results for SYN12—)-REAL6 and REAL12->SYN6 
are shown in Table [4] We see that our method outper¬ 
forms PADA and the source by a large margin, which fur¬ 
ther demonstrates the ability of our approach on large scale 
datasets in the partial setting. 

4.5. Ablation test 

In order to demonstrate empirically the impact of the dis¬ 
similarity loss on performance, we considered throughout 
all experiments an ablation test for our model with (Ours) 
and without (Ours (no Lais)) the dissimilarity loss. Note 
also that our model without both clustering and dissimilarity 
losses reduces to standard ADDA and on the partial domain 
adaptation to ADDA-mix. In general, we observed that our 
approach with and without dissimilarity loss consistently 
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Table 5: Partial domain adaptation on Office31. 


Method 

A31—>• W10 

A31—)>D10 

W31—>A40 

W31—)-D10 

D31—>• W10 

D31—»• A10 

Source 

0.664 

0.701 

0.691 

0.968 

0.980 

0.690 

dann(9] 

0.498 

0.529 

0.496 

0.624 

0.314 

0.468 

ADDA 129 

0.593 

0.675 

0.727 

0.764 

0.705 

0.686 

ADDA-mix 

0.610 

0.713 

0.722 

0.949 

0.905 

0.707 

SAN [4] 

0.800 

0.813 

0.831 

1.000 

0.986 

0.806 

PAD A (3 

0.865 

0.822 

0.954 

1.000 

0.993 

0.927 

Ours (no L dis ) 

0.834 

0.834 

0.944 

1.000 

0.902 

0.886 

Ours 

0.834 

0.847 

0.943 

1.000 

0.997 

0.928 



Ydec 

Figure 6 : Accuracy on target testing data with different 
values of 7 dec- 

outperforms ADDA and ADDA-mix, while ADDA-mix is 
generally better than ADDA. Further, the dissimilarity loss 
often results in performance gains relative to the model with¬ 
out it. These results suggest that training with L& ec and Ldis 
indeed helps in producing more discriminative target feature 
spaces. It also demonstrates that the augmented target in 
partial adaptation can reduce the negative transfer when the 
source and target label domains are different, as described in 
Section [221 

4.6. Learning rates for clustering 

Our approach improves the discriminative ability of the 
target representation by encouraging target clustering via 
Ldec and L^ s . However, by doing so we introduce additional 
complexity and tuning requirements. Therefore, we conduct 
a sensitive analysis for 7d ec and 7di s on the digit datasets 
with settings similar to that of Section |4~2| 

For this experiment, all other hyperparameters of the 
model are fixed and set to those in ADDA. The relation 
between 7d ec and 7di S is set to 7di S = 27 d ec > as previously 
described. We do so because the cluster centers are updated 
with both Ldec an d Aiis> but we seek for Ldis to dominate 
the update to promote class-dissimilar clusters and avoid 
domain collapse. We consider all three adaptation directions, 
i.e., SVHN^MNIST, USPS^MNIST and MNIST^USPS. 

The resulting test accuracies for different values of 7d ec 
are shown in Figure [6] For MNIST—HJSPS, we find that 


the target accuracy is not sensitive to the clustering learning 
rates on the testing range, i,e,, [le — 5, 7e — 3]. This suggests 
that, with small 7d ec , though the algorithm can take longer 
to converge (results not shown), the model will converge 
to a reasonable optimal solution. Alternatively, with large 
7dec> the clustering strategy alone is able to produce a dis¬ 
criminative target representation. On the other hand, for 
USPS^MNIST and SVHN^MNIST, the performance will 
drop dramatically when 7d ec is too large. Nevertheless, we 
can still outperform ADDA by a large margin (see Table [I]), 
as long as the selected learning rates are not too extreme. 

5. Conclusion 

We proposed a new domain adaptation method that ex¬ 
tends the ability of existing adaptation approaches based on 
distribution matching to imbalanced and partial scenarios. 
Our method improves the discrimination of target represen¬ 
tations by simultaneously learning tightly clustered target 
embeddings and by encouraging each cluster to be assigned 
to a unique and different class from the source. These crite¬ 
ria guarantee the robustness of the proposed method against 
differences between the source and target label distributions, 
which relaxes the common assumption that source and target 
share the same label domain and distribution. We focused 
on three scenarios, namely, balanced domain adaptation, im¬ 
balanced domain adaptation and partial domain adaptation. 
Experiments on several benchmark datasets demonstrated 
the effectiveness of the method on all the three scenarios, 
achieving state-of-the-art performances. 

As future work, we are interested in extending our ap¬ 
proach to semisupervised domain adaptation and segmenta¬ 
tion adaptation, in which we will seek for pixels from the 
same segment in the target domain to be clustered in feature 
space and located nearby source features of the same class. 
This is of interest particularly in medical imaging where 
segmentation is very expensive and time consuming. Fur¬ 
ther, the clustering objective used in our approach relies on 
a reasonable initialization of the centroids. Additional work 
is needed on more robust clustering procedures that are less 
sensitive to centroid pre-initialization. 
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